NO-ISSUE: Restrict test retries to an allowlist of test names#31222
Conversation
WalkthroughThe pull request refactors test retry eligibility in ChangesTest Retry Allowlist Refactor
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7
📒 Files selected for processing (3)
pkg/test/ginkgo/retries.gopkg/test/ginkgo/retries_test.gopkg/test/ginkgo/retry_allowed_jobs.yaml
9d45683 to
0412516
Compare
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai summary |
|
✅ Actions performedSummary regeneration triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b
📒 Files selected for processing (3)
pkg/test/ginkgo/retries.gopkg/test/ginkgo/retries_test.gopkg/test/ginkgo/retry_allowed_tests.yaml
|
/cc @dgoodwin |
|
The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes. I suggest this query: SELECT
rs.TestName,
COUNT(*) AS flake_count,
COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
SUM(rs.TotalAttempts) AS total_attempts,
SUM(rs.FailedAttempts) AS total_failed_attempts
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
ON rs.JobName = jv.job_name
AND jv.variant_name = 'Release'
AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
AND rs.TestName IS NOT NULL AND rs.TestName != ''
AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESCWe're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable. If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests. Threshold options from claude: ┌───────────┬────────────┬────────────┬──────────────────────┐ We could go lower though, 2-3 would also be reasonable IMO. |
|
|
||
| func NewRetryOnceStrategy() *RetryOnceStrategy { | ||
| return &RetryOnceStrategy{ | ||
| PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image |
There was a problem hiding this comment.
@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.
| "[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-57658 | ||
| "[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61287 | ||
| "[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61734 | ||
| "[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967 |
There was a problem hiding this comment.
Please check if all these are in the list with the new query I added in the PR, if not we should add them.
Also could you ensure the list is sorted, the query I provided isn't. :)
There was a problem hiding this comment.
The list is now sorted. The "migrated" whitelisted tests are there. After running your revised query we are down to ~200 tests only.
|
Code looks good, just a revision to the query and some sorting concerns. |
0412516 to
8b792da
Compare
Replace the image-tag-based and hardcoded retry eligibility logic in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure. Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML file to get a retry. Tests not on the list must pass on the first attempt. The allowlist is populated from BigQuery retry_statistics data: all 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665, OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered by the BigQuery data and no longer need special treatment in Go code. The list should be reduced over time as flaky tests are fixed. To disable retries entirely, empty the YAML list.
8b792da to
cf2226e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/test/ginkgo/retry_allowed_tests.yaml (1)
8-23: ⚡ Quick winConsider enriching the documented query with additional metrics.
The BigQuery query documents how to regenerate the allowlist, but it only returns
TestNameandflake_count. Based on the PR comments, a more informative query would also includedistinct_jobs_affected,total_attempts, andtotal_failed_attemptsto help maintainers make better decisions when regenerating the list.📊 Suggested enhanced query
SELECT rs.TestName, COUNT(*) AS flake_count, COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected, SUM(rs.TotalAttempts) AS total_attempts, SUM(rs.FailedAttempts) AS total_failed_attempts FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv ON rs.JobName = jv.job_name AND jv.variant_name = 'Release' AND jv.variant_value = '4.22' WHERE rs.FinalOutcome = 'flaky' AND rs.TestName IS NOT NULL AND rs.TestName != '' AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) GROUP BY rs.TestName HAVING COUNT(*) >= 5 ORDER BY flake_count DESCThis provides better visibility into the scope and impact of each flaky test.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/test/ginkgo/retry_allowed_tests.yaml` around lines 8 - 23, Update the documented BigQuery in pkg/test/ginkgo/retry_allowed_tests.yaml (the query that selects from retry_statistics aliased as rs and joins job_variants as jv) to include the additional metrics: COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected, SUM(rs.TotalAttempts) AS total_attempts, and SUM(rs.FailedAttempts) AS total_failed_attempts; keep the existing WHERE, JOIN, GROUP BY rs.TestName, HAVING COUNT(*) >= 5 and ordering by flake_count so the query returns TestName, flake_count plus the three new aggregate columns to provide better context for maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/test/ginkgo/retry_allowed_tests.yaml`:
- Around line 8-23: Update the documented BigQuery in
pkg/test/ginkgo/retry_allowed_tests.yaml (the query that selects from
retry_statistics aliased as rs and joins job_variants as jv) to include the
additional metrics: COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
SUM(rs.TotalAttempts) AS total_attempts, and SUM(rs.FailedAttempts) AS
total_failed_attempts; keep the existing WHERE, JOIN, GROUP BY rs.TestName,
HAVING COUNT(*) >= 5 and ordering by flake_count so the query returns TestName,
flake_count plus the three new aggregate columns to provide better context for
maintainers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 89a30401-3429-4f1c-8564-5069943c9079
📒 Files selected for processing (3)
pkg/test/ginkgo/retries.gopkg/test/ginkgo/retries_test.gopkg/test/ginkgo/retry_allowed_tests.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/test/ginkgo/retries_test.go
- pkg/test/ginkgo/retries.go
|
Scheduling required tests: |
|
/lgtm Nice when one of our strategic initiatives is closed out in a couple hours and one PR :) Thanks! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mkowalski: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-metal-ipi-ovn-ipv6 |
|
/verified by later |
|
@mkowalski: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@mkowalski: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| var config retryAllowedTestsConfig | ||
| if err := yaml.Unmarshal(retryAllowedTestsYAML, &config); err != nil { | ||
| logrus.WithError(err).Error("Failed to parse retry_allowed_tests.yaml, no tests will be allowed to retry") | ||
| return map[string]bool{} |
There was a problem hiding this comment.
This should've used the kube set type, FWIW
|
Sorry I only got around to looking at this now, logic looks correct to me 👍🏻 |
Summary
shouldRetryTest()with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.retry_statisticsdata, scoped to release 4.22, last 30 days, with a threshold of >= 5 flake occurrences. This yields 206 tests.Behavior
tests: [])How to allow retries for a test
Add the full test name to
pkg/test/ginkgo/retry_allowed_tests.yaml:The file is embedded at compile time (
go:embed), so changes require rebuildingopenshift-tests.How to regenerate the allowlist from BigQuery
The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).
Summary by CodeRabbit